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of  thought. 


DD  /no"*..  14 73 


K#MurHv  Cl  mih  i  fu  wt  ton 


1 4. 


KEY  WORDS 


History  of  Statistics 
Robust  Estimation 
Order  Statistics 
Outl iers 
Weighted  Means 
Trimmed  Means 


tT 


SIMON  NEWCOMB,  PERCY  DANIELL,  AND  THE  HISTORY  OF  ROBUST 


ESTIMATION  1885-1920. 


by 

Stephen  M.  Stig'ier* 

The  University  of  Wisconsin,  Madison 


0.  Introduction 


In  the  eighteenth  century,  the  word  "robust"  was  used  to  refer 
to  someone  who  was  strong,  yet  boisterous,  crude,  and  vulgar.  By  1953 
when  Box  first  gave  the  word  its  statistical  meaning,  the  evolution  of 
language  had  eliminated  the  negative  connotation:  robust  meant  simply 
strong,  hardy,  healthy.  The  subject  of  robust  inference,  just  like  the 
v/ord  "robust",  has  a  long  and  varied  history.  It  is  the  aim  of  this 
present  study  to  examine  a  part  of  this  history  and  its  relationship  to 
current  work. 

The  scope  of  this  paper  will  be  rather  narrow  -  we  shall  only  be 
concerned  with  the  mathematical  background  and  development  of  robust 
estimation  up  to  1920.  Thus  we  shall  be  less  concerned  with  the  first 
appearances  of  estimators  sue!)  as  the  median  and  trimmed  mean  than  with 
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the  first  mathematical  analyses  of  their  behavior  and  properties.  The 
main  emphasis  will  be  on  the  period  1885-1920,  and  particular  attention 
will  be  given  to  work  which  is  not  widely  known,  yet  is  relevant  to 
modern  lines  of  thought.  Section  two  discusses  the  contributions  of 
Simon  Newcomb  to  robust  estimation,  and  to  the  use  of  normal  mixtures  as 
models  for  heavy-tailed  distributions;  section  three  is  concerned  with 
the  history  of  the  mathematical  analysis  of  order  statistics  in  relation 
to  robust  estimation,  with  due  attention  to  the  works  of  Laplace, 
Sheppard,  and  Percy  Daniell;  and  section  four  contains  some  brief  remarks 
on  "M-estimators" . 

The  reader  may  be  as  surprised  as  the  author  was  to  find  to 
what  extent  priorities  in  these  areas  have  been  misassigned.  While  many 
other  points  will  be  touched  upon  in  the  paper,  our  major  findings  are 
as  follows:  Laplace  (1818)  and  Sheppard  (1899)  seem  to  have  been  the 
first  to  present  a  large  sample  theory  for  one  or  two  order  statistics. 
Simon  Newcomb  (1886)  provided  the  first  sound,  modern  approach  to  robust 
estimation,  including  the  first  use  of  mixtures  of  normal  densities  as 
representing  heavy-tailed  distributions.  Percy  Daniell  (1920)  should  be 
credited  with  the  first  mathematical  analysis  of  the  class  of  estimators 
which  are  linear  functions  of  order  statistics,  including  the  derivation 
of  the  optimal  weighting  functions  for  estimating  scale  and  location 

parameters  (the  so-called  "ideal"  linear  estimators)  and  the  first 

\ 

mathematical  treatment  of  the  trimmed  mean.  Some  of  Newcomb's  work  has 
been  commented  upon  recently  by  Huber  (1972),  but  much  of  the  remainder 
of  the  work  discussed  in  this  paper,  including  that  due  to  Edgeworth, 
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Galtori,  Laplace,  Sheppard,  and  Daniell,  has  been  largely  ignored  in 
recent  years. 

We  shall  begin  with  a  brief  overview  of  the  situation  prior  to  1885. 

1 .  The  Situation  before  1885. 

Scientists  have  been  concerned  with  what  we  would  call  "robustness"  - 
sensitivity  of  procedures  to  departures  from  assumptions,  particularly  the 
assumption  of  normality  -  for  as  long  as  they  have  been  employing  well- 
defined  procedures,  perhaps  longer.  For  example,  in  the  first  published 
work  on  least  squares,  Legendre  (1805)  explicitly  provided  for  the  rejection 
of  outliers : 

"If  among  these  errors  are  some  which  appear  too  large  to  be  admissible, 
then  those  equations  which  produced  these  errors  wil 1  be  rejected,  as 
coming  from  too  faulty  experiments,  and  the  unknowns  will  be  determined 
by  means  of  other  equations,  which  will  then  give  much  smaller  errors". 

Yet  most  of  the  early  work  in  mathematical  statistics  was  obsessed  with 

"proving"  the  method  of  least  squares,  either  starting  with  the  assumption 

that  the  sample  mean  is  the  best  estimate  of  the  mean  and  deriving  the 

normal  distribution,  as  Gauss  did  in  his  first  proof  in  1309,  or  starting 

with  the  normal  law  or  the  central  limit  theorem,  as  did  Laplace  in  1812. 

The  first  mathematical  work  on  robust  estimation  seems  to  have  been  that  of 

Laplace  (1818)  on  the  distribution  of  the  median.  We  shall  defer  a  discussion 

of  Laplace's  work  until  section  three,  where  it  will  be  considered  with 

later  work  on  linear  functions  of  order  statistics. 

The  next  statistical  problem  connected  with  robust  estimation  to 
receive  mathematical  treatment  was  the  rejection  of  outliers.  In  1852, 
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the  first  proposal  of  a  criterion  for  the  determination  of  outliers  was 
published  by  Benjamin  Peirce*  the  Harvard  mathematician-astronomer  and 
father  of  logician-philosopher  C.  S.  Peirce.  Peirce's  paper  and  most  others 

ie 

on  this  subject  are  not  really  about  robust  estimation,  as  their  authors 

did  not  concern  themselves  with  the  properties  of  the  resulting  estimators; 

rather,  they  implicitly  assumed  that  after  the  outlier  test  was  performed 

the  estimation  could  be  done  with  no  thought  given  to  what  had  gone  before, 

nor  what  information  might  be  lost.  This  narrowness  of  view  did  not  go 

unnoticed  at  the  time.  The  first  paper  proposing  an  outlier  criterion 

(Peirce,  1852)  was  soon  followed  by  the  first  paper  criticizing  the  use 

of  outlier  criteria  (Airy,  1856).  Airy,  the  Astronomer  Royal,  wrote: 

"And  I  have,  not  without  surprize  to  myself,  been  led  to  think  that 
the  whole  theory  is  defective  in  its  foundation,  and  illusory  in  its 
results;  that  no  rule  for  the  exclusion  of  observations  can  be 
obtained  by  any  process  founded  purely  upon  a  consideration  of  the 
discordance  of  these  observations". 

A  lively  debate  ensued,  with  the  participants  not  always  expressing  them¬ 
selves  with  Airy's  restraint.  For  example,  Glaisher  (1872)  wrote  "Professor 
Pierce's  [sic]  criterion  for  the  rejection  of  doubtful  observations  seems 
to  me  to  be  destitute  of  scientific  precision". 

One  of  the  more  interesting  papers  of  this  time  (and  one  of  the 
most  unusual  statistical  papers  of  all  time)  appeared  in  the  Report  of 
the  Superintendent  of  the  U.S.  Coast  Survey  for  1870.  It  is  by  C.  S.  Peirce, 


"k 

See  Anscombe  (1960)  and  Rider  (1933)  for  historical  surveys  of  outlier 
techniques . 

* 

At  one  point  an  exchange  in  print  between  the  mathematician  Glaisher 
and  the  astronomer  Stone  became  so  heated  that  one  of  Glaisher's  papers 
was  itself  rejected  by  the  Monthly  Notices  of  the  Royal  Astronomical 
Society  due  to  the  personal  nature  o'fliTs  "comments ;  see  Glaisher  (T874' • 
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written  while  he  was  an  Assistant  to  the  Coast  Survey  (at  the  time  his 
father  was  Superintendent  of  the  Survey!).  In  the  paper  Peirce  presented 
the  then  standard  material  of  the  theory  of  errors,  but  in  the  language 
and  notation  which  he  had  developed  for  the  logic  of  relations,  for 
which  he  later  became  famous.  Thus  we  find,  regarding  averages, 

"Since  [m]  denotes  all  men,  we  may  naturally  write  to  denote 

what  all  men  become  when  that  factor  is  removed  which  makes  [m] 
refer  to  men  rather  than  to  anything  else;  that  is  to  say,  to  denote 
the  number  of  men.  We  may  write  this  for  short  [m3  with  heavy 
brackets.  Then  t  being  a  relative  term  ("a  tooth  of,")  by  [tlj 
will  be  denoted  the  total  number  of  teeth  in  the  universe.  But 

CO  will  be  used  as  equivalent  to  or  the  average  number  of 

teeth  that  anything  has." 

Peirce  included  a  sensible  -  one  is  tempted  to  say  "logical"  -  defense  of 
his  father's  outlier  criterion  in  the  paper  (p.  210).  By  1885  a  number 
of  rejection  criteria  were  in  use,  often  only  by  the  proposer  and  his  employees. 

But  techniques  other  than  simply  "reject  outliers,  then  use  the 
sample  mean"  were  also  employed.  A  variety  of  weighted  means  had  been 
used  prior  to  1885.  For  example.  In  1763  James  Short  (an  English  astronomer 
and  noted  manufacturer  of  telescopes)  had  estimated  the  sun's  parallax 
based  ort  observations  of  the  transit  of  Venus  of  1761  by  averaging  three 
means:  the  sample  mean,  the  mean  of  all  observations  with  residuals  less 

than  one  second,  and  the  mean  of  those  with  residuals  less  than  half  a 

second.  The  median  and  the  midrange  had  appeared  even  earlier 
(Eisenhart  (1971 )) . 

By  the  last  half  of  the  nineteenth  century,  weighted  least 
squares  had  become  a  standard  topic  in  the  literature  of  the  theory  of 
errors,  and  it  was  a  frequent  practice  (at  least  in  astronomical 
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investigations )  to  weight  observations  differently,  depending  upon  the 

★ 

statistician's  (often  subjective)  estimate  of  the  "probable  error"  of 
the  observation.  The  estimate  of  the  probable  error  was  supposed  to  be 
based  solely  on  external  evidence:  scientists  were  warned  of  the  possible 
biases  if  the  magnitude  of  the  observation  were  allowed  to  influence  its 
weight  (see  Jevons  (1874,  p.  450),  for  example),  but  it  is  doubtful  that 
this  advice  was  faithfully  adhered  to.  We  shall  discuss  the  use  of  these 
weighted  means  further  in  the  next  section,  in  connection  with  the 
contributions  of  Simon  Newcomb. 

Other  estimators  were  proposed  in  this  period.  In  particular, 

De  Morgan  (1847,  p.  456)  h rut  i  mod  a  scheme  for  discounting  the  more 
extreme  reservations.  Tim:.  method,  more  fully  developed  by  Glaisher  (1873), 
amounted  to  starting  with  the  sample  mean,  then  assigning  different  probable 
errors  to  the  different  observations  based  on  the  value  of  the  likelihood 
function  at  those  observations,  and  iterating  this  process.  Glaisher's 
estimate  was  criticized  by  both  Stone  (1873)  and  Edgeworth  (1883),  who 
both  (independently)  proposed  an  alternative  based  on  looking  at  a  local 
maximum  of  the  likelihood  function  (without  assuming  equal  probable  errors). 
Edgeworth  later  became  disenchanted  with  this  alternative  (Edgeworth,  1887a). 

At  about  this  time,  Francis  Galton  was  making  much  use  of  the 
median  (Galton,  1875),  although  his  motivation  was  less  suspicion  of  the 
normal  distribution,  which  he  considered  a  good  representation  of  many 
real  phenomena,  than  an  appreciation  of  the  simplicity,  ease  of  calculation. 


The  probable  error  of  a  symmetric  distribution  is  half  the  interquarti le 
range;  for  normal  distributions  p.e.  =  (,6745)o. 
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and  ease  of  interpretation  of  the  median.  Also,  various  formulae  for 
index  numbers  were  developed  during  this  period;  these  included  weighted 
averages  and  geometric  means,  each  designed  for  a  specific  purpose. 

However,  it  can  still  be  said  that  by  1885,  the  conventional 
wisdom  (but  by  no  means  the  unanimous  view)  was  that  for  purposes  of  estimation, 
the  cautious  use  of  the  sample  mean  was  recommended  -  sometimes  weighted, 
sometimes  after  discarding  outliers,  but  still  the  sample  mean. 

2.  Simon  Newcomb  and  mixtures  of  normal  densities 


1885  can  be  conveniently  taken  as  the  start  of  one  of  the  most 
active  and  innovative  periods  in  the  history  of  mathematical  statistics. 

The  story  of  the  development  of  mathematical  statistics  into  a  subject 
in  its  own  right  through  the  wo*"k  of  such  men  as  Edgeworth,  Karl  Pearson, 
Gosset,  and  Fisher  has  been  told  by  E.  S.  Pearson  (1967).  Our  present, 
rather  narrow  purpose  is  to  describe  how  the  modern  theory  of  robust  estim¬ 
ation  developed  over  this  period.  To  this  end,  we  shall  place  particular 
emphasis  on  the  introduction  of  mixtures  as  models  for  the  heavy-tailed 
distributions  which  scientists  had  encountered  in  practice,  and  on  the 
use  of  linear  functions  of  order  statistics  as  robust  estimators  of 
location  parameters. 

Simon  Newcomb  appears  to  have  been  the  first  to  introduce  a  mixture 
of  normal  densities  as  a  model  for  a  heavy-tailed  distribution,  and  to 
exploit  this  model  to  get  an  estimator  of  location  which  was  more  robust 
than  the  sample  mean.  (Francis  Galton  and  Karl  Pearson  had  modeled  measure¬ 
ments  of  natural  populations  by  normal  mixtures  about  the  same  time,  but 
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with  a  completely  different  object  in  mind,,  namely  to  demonstrate  how  a 
single  population  could  be  broken  down  into  components.)  While  Newcomb's 
name  may  be  unfamiliar  to  present  day  statisticians,  it  should  not  be  so 
to  astronomers,  applied  mathematicians,  and  economists. 

Simon  Newcomb  (1835-1909)  was  born  in  Nova  Scotia,  attended 
Harvard,  and  spent  most  of  his  adult  life  (1861-1897)  as  a  professor  of 
mathematics  in  the  U.S.  Navy,  working  for  the  U.S.  Nautical  Almanac 
Office.  He  is  generally  regarded  as  the  greatest  American  astronomer 
of  the  nineteenth  century,  and  was  responsible  for  many  of  the  determin¬ 
ations  of  astronomical  constant  which  are  still  accepted  today.  In  addition, 
he  was  a  powerful  applied  mathematician,  co-founded  and  for  many  years 
edited  the  American  Journal  of  Mathematics,  and  as  an  avocation  wrote 
Principles  of  Political  Economy  (1885),  a  book  which  has  established  him 
as  a  major  American  economic  theorist,  and  which  contains  one  of  the 
earliest  modern  mathematical  statements  of  the  quantity  theory  of  money. 

As  was  the  practice  in  astronomy  at  the  time,  Newcomb  made 

frequent  use  of  weighted  means  in  his  estimation  of  astronomical  constants. 

The  relative  weights  were  usually  thought  of  in  terms  of  "probable  errors", 

and  were  assigned  somewhat  subjectively  on  the  basis  of  Newcomb's  judgment 

of  the  relative  accuracy  of  the  process  which  produced  the  observation. 

For  example,  after  assessing  some  data  on  eclipses  collected  by  Ptolemy 

in  the  second  century  A.O.,  he  remarked  (Newcomb,  18/8,  p.  41): 

"the  [assigned]  probable  errors  are  the  result  of  judgment  from  the 
terms  of  [Ptolemy's]  description  rather  than  of  calculation;  they 
were  estimated  without  any  knowledge  of  the  way  the  comparison  with 
theory  would  come  out,  and  are  printed  without  subsequent  alteration". 

With  more  contemporary  data,  Newcomb  would  base  his  choice  of  weights 
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upon  "the  quality  of  the  image  and  the  generally  satisfactory  way  in 
which  the  image  was  kept  on  the  crosswires"  (Newcomb,  1891a,  p.  170)  in  the 
case  of  an  experiment  he  was  peronsally  involved  with,  and  upon  the 
number  of  observers,  general  opinion  of  the  reporting  observatory ,  and 
"number  and  force  of  the  doubtful  circumstances  "(Newcomb,  1891b,  p.  383), 
in  cases  involving  combination  of  other's  measurements.  He  was  apparently 
aware  of  criticism  cf  the  subjective  nature  of  these  assignments,  but 
he  maintained  that 

"Opinions  may  doubtless  differ  as  to  whether  a  judicious  system  of 
weights  has  always  been  applied,  but  it  is  not  likely  that  any  unbiased 
reassignment  would  materially  affect  the  result".  (Newcomb,  1898, 

p.  211) 

Newcomb  also  rejected  outliers  when  necessary,  but  usually  only  based  on 
external  evidence  or  really  huge  deviations. 

With  this  experience  in  dealing  with  observations  made  with 

differing  degrees  of  precision,  it  is  not  surprising  that,  when  faced  with 

a  collection  of  nun-normal  observations  for  which  there  was  no  satisfactory 

way  to  weight  them  individually,  he  should  consider  a  mixture  of  normal 

densities  with  different  variances  as  a  model.  For,  having  observed  that 

a  collection  of  684  residuals  based  on  observations  of  the  transits  of 

Mercury  had  much  heavier  tails  than  the  corresponding  normal  distribution 

(even  with  excessive  deviations  ignored),  he  wrote  (Newcomb,  1882,  p.  382): 

"It  is  evident  that  if  we  have  a  collection  of  observations  of 
different  degrees  of  probable  error,  in  which,  however,  there  is  no 
way  of  distinguishing  those  of  great  probable  error  from  those  of 
small  probable  error,  the  law  of  the  errors  will  not  be  that  usually 
adopted,  but  there  will  be  a  comparative  excess  of  large  residuals. 

It  is  also  evident  that  in  such  a  case  the  arithmetical  mean  does 
not  necessarily  give  the  most  probable  result.  For,  in  the  case  of  an 
observation  of  large  residual,  there  is  evidently  a  preponderance 
of  probability  that  it  belongs  to  a  class  with  large  probable  error. 
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and  therefore  should  be  assigned  least  weight.  ...  That  any  general 
collection  of  observations  of  transits  of  Mercury  must  be  a  mixture 
of  observations  with  different  probable  errors  was  made  evident  to 
the  writer  by  his  observations  of  the  transit  of  May  6,  1878,  which 
may  be  here  described  as  an  illustration  of  the  subject," 

Four  years  after  writing  this,  Newcomb  published  a  remarkable 
paper  in  his  own  journal,  the  American  Journal  of  Mathematics ,  in  which 
he  used  this  model  to  arrive  at  a  more  robust  estimator  of  location  than 
the  sarnole  mean.  In  this  paper  (Newcomb,  1886)  ,  after  criticizing  the 
overuse  of  outlier  criteria  and  presenting  his  mixture  model,  he  proceeded 
to  develop  an  estimator  upon  the  principles  of  liayesian  decision  theory  that 
gave  "less  weight  to  the  more  discordant  observations".  Adopting  squared 
error  as  a  loss  function  (Newcomb's  word  for  loss  was  "evil"),  he  demonstrated 
that  in  general  the  posterior  mean  minimizes  the  expected  mean  square  error, 
and  he  suggested  the  following  procedure.  1)  Calculate  the  residuals 
based  on  the  sarnole  mean,  and,  using  trial  and  error,  fit  a  mixture  of  a 
finite  number  of  normal  densities  with  zero  means  to  these  residuals. 

2)  Take  this  fitted  mixture  and,  considering  the  location  family  it 
generates,  estimate  the  desired  mean  by  the  posterior  mean  with  respect 
to  a  uniform  prior  given  the  original  observations.  Newcomb  realized  that 
this  procedure  presented  practical  difficulties  and  gave  a  number  of 
simplifying  approximations  to  arrive  at  a  usable  estimator.  He  illustrated 
its  use  with  the  data  on  the  transits  of  Mercury. 


Some  of  his  arguments  also  appear  in  Newcomb  (1895),  p.  81-86. 

Ogorodnikoff  (1928)  provided  a  different  simplification  of  Newcomb  s 
estimator  based  on  a  Chari ier  expansion  of  the  posterior  distribution. 
The  relationship  between  Newcomb's  simplified  estimator  and  the  maximum 
likelihood  estimator  was  discussed  by  Hulme  and  Symms  (1939). 


1) 


As  an  Interesting  sidelight,  wo  note  that  In  this  oaper  and  In 

a  later  work,  Newcomb  mado  an  early  uso  of  a  simple  version  of  Tukey's 

sensitivity  function  (see  Andrews  ot,  al . ,  1972,  p,  90),  In  Newcomb 

(1912,  p,  212),  discussing  the  unsatisfactory  nature  of  outlier  criteria, 

he  wrote  that  If  all  observations  with  large  residuals  are  rejected  (and 

the  mean  estimated  from  the  remaining  observations),  then  the  final  result 

“becomes  a  discontinuous  function  of  the  residual  of  the  rejected 
observation,  the  continuity  being  broken  at  the  point,  regarded  as 
the  limit  of  normal  error.  A  simple  example  will  make  the  case 
clear,  If  wo  have  three  observed  results  a,b,c  of  which  the  moan 
Is  to  be  taken,  and  if  c  be  the  result  which  may  bo  abnormal,  then 
so  long  as  e  Is  retained  we  shall  have 

mean  «  ^  ( d  +  b  +  c ) ; 

the  mean  will  then  continuously  Increase  with  c.  When  c  passes 
the  normal  limit,  the  mean  changes  per  saltum  to 

}(«  +  b)“. 

i* 

In  the  same  posthumous  paper  (Newcomb,  1912,  p.  214),  he  also  proposed  a  very 
simple  estimator  in  the  spirit  of  his  1BU6  paper;  weight  the  observation 
X^  by  w^  a  c/max( |Xj-R| ,  c),  where  c  Is  a  constant  to  be  specified, 

3 .  Laplace,  Sheppard,  Daniel  1,  and  linear  functions  of  order  statistics 

With  few  exceptions,  statisticians  were  quite  late  in  coming  to 
consider  any  but  the  simplest  linear  functions  of  order  statistics  as 
estimators  of  means.  By  a  linear  function  of  order  statistics  wo  shall 
mean  any  weighted  linear  combination  of  observations  where  the  weights 
depend  only  on  their  order,  not  on  their  magnitudes  or  the  size  of  their 


roil  duals.  The  median  and  the  midrange,  two  members  of  this  el  at  s,  evidently 
have  a  long  history  (Ll&enhart.  (lfM),  (1971)),  hut  perhaps  the  first 
extensive  mathematical  analysis  to  he  published  Involving  order  statistics 
was  by  Unlace.  In  the  second  supplement  (111111)  to  his  monumental 
Thtiorle  Analytlguu  des  Probability  Laplace  considered  the  problem  we 
would  now  call  linear  roqresslon  throuqh  the  oriole;  a^  *  p^y  +  x^ , 
a^,  p^  known,  y  to  he  estimated,  where  the  errors  x^  were  assumed  to 
have  an  arbitrary  continuous,  symmetric  distribution,  Hy  looking  for 
that  estimator  which  minimised  the  sum  of  the  absolute  values  of  the 
residuals,  he  was  led  to  consider  an  estimator  of  y  which  reduces  to 
the  median  of  the  a^'s  In  the  case  pj  1,  Laplace  derived  the  density 
of  this  estimator,  showed  that  this  dens  1  tv  approaches  the  normal  density 
as  the  sample  size  Increases,  and  gave  the  necessary  and  sufficient 
condition  on  the  error  distribution  that  the  median  have  a  smaller 
asymptotic  variance  than  the  sample  mean.  Laplace's  proof  Is  easily 
adopted  to  any  sample  percentile  and  asymmetrical  populations,  as  was  In 
fact  later  noted  by  Edgeworth  (10U5,  1  UU(> ) .  In  addition,  Laplace  derived 
the  joint  asymptotic  density  of  the  sample  mean  and  median,  and  used  It 
to  find  which  linear  combination  of  these  estimators  has  the  smallest 
asymptotic  variance.  (As  the  weights  depend  upon  the  unknown  error 
distribution,  he  termed  this  result  "Impracticable",  but  noted  that  If  the 
error  distribution  were  normal*  the  best  linear  combination  was  the  sample 


* 

Laplace  actually  carried  through  his  entire  investigation  in  the  more 
general  regression  situation,  comparing  the  general  estimator  with  the 
least  squares  estimator  for  this  situation.  For  other  views  of  Laplace's 
work  and  its  historical  context,  see  Eisenhart  (1961)  and  Stigler  (1972). 
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mm  Alone,)  Two  y«ars  boforo  Laplace's  investigation,  Gauss  (11116), 
considering  the  pix>bl chu  of  estimating  the  probable  error  of  a  normal 
distribution,  had  suggested  the  use  of  the  median  of  the  absolute 
values  of  the  residuals,  and  stated  (without  proof)  the  asymptotic  probable 
error  of  the  median  for  this  special  ease,  Gauss  apparently  never 
published  or  circulated  a  proof,  for  18  years  later  Encko  (11)34),  who 
had  corresponded  with  Gauss,  felt  It  necessary  to  provide  one,  attributing 
It  to  Olrlchlet,  It  seems  likely  that  Dlrlchlot's  proof  for  this  special 
ease  was  simply  an  adaptation  of  Laplace's,  as  Olrlchlet  was  quite 
familiar  with  Laplace's  work,  the  sucond  supplement  In  particular  (see 
Olrlchlet  (1830)), 

Later  In  the  nineteenth  century,  Gal  ton  (1875)  and  particularly 
Edgeworth  (1885,  1887b,  1888),  touted  the  use  of  the  median  In  situations 
where  heavier  tails  than  the  normal  could  be  expected.  Specifically, 
Edgeworth  (1888)  used  Laplace's  results  to  conclude  that  the  median  may 
well  be  better  than  the  mean  when  the  population  distribution  Is  one  of 
Newcomb's  mixtures  of  normal  distributions.  Also,  Edgeworth  (1886)  seems 
to  have  been  the  first  to  realize  that,  the  median  may  possess  an 
advantage  over  the  sample  mean  for  serially  correlated  data. 

More  complicated  linear  estimators  began  to  appear  in  1889, 
when  Galton  (in  a  footnote  on  p.  61-62  of  Natural  Inheritance)  suggested 
estimating  the  mean  and  standard  deviation  of  a  normal  distribution  by 
what  amounts  to  taking 


The  possibility  of  a  linear  function  of  tv/o  estimators  outperforming  both 
has  been  more  fully  exploited  in  the  recent  Princeton  robustness  study 
(see  Andrews  et.  al  (1972)  p.  132). 


whore  f,  ami  t,  are  the  p  and  q  percent  Ilea  of  the  standard  normal 

distribution,  X^'^  and  X^*1^  are  the  sample  p  and  q  percentiles, 

and  p  and  q  are  arbitrary  but  fixed  (0  <  p  <  q  <  1).  In  1 809  In  a 

lonq  paper  on  the  multivariate  normal  distribution  and  Its  applications, 

Sheppard  proved  the  joint  asymptotic  normality  of  Galton's  estimators 

when  the  population  is  normal.  He  also  showed  the  joint  asymptotic 

normality  of  X^11^  and  X^nr,\  and  gave  analogues  to  p  and  o  based 

on  any  finite  number  of  sample  percentiles  (Sheppard,  1899,  p.  131-132). 

Sheppard's  (sketchy)  proof,  which  is  based  on  an  implicit  use  of  the 

probability  Integral  transformation,  can  be  easily  adapted  to  any  regular 
★ 

distribution. 

Sheppard's  paper  also  represented  the  first  attempt  since  Laplace 
to  optimize  performance  within  a  class  of  linear  functions  of  order 
statistics.  He  both  showed  how  the  best  choice  (for  normal  populations) 
of  p  and  q  can  be  made  (1899,  p.  135)  and  found  which  linear 
combination  of  the  three  quartiles  has  the  smallest  asymptot'.c  variance 
(again  for  normal  populations)  (1899,  footnote,  p.  134).  Such  functions 


Twenty  years  later,  Karl  Pearson  (1920)  presented  part  of  Sheppard's 
proof  in  more  detail,  made  the  obvious  step  to  more  general  distributions 
than  the  normal,  and  much  more  fully  examined  the  consequences  of  the  result. 
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of  the  three  quartiles  had  been  considered  earlier  by  Edgeworth  (1893), 
who  neglected  the  quartlles'  correlation  and  erroneously  claimed  the 
estimator  with  weights  In  proportions  5:7:5  to  be  superior  to  the  sample 
mean  for  normal  populations.  Recent  work,  however,  seems  to  bear  out 
Edgeworth's  claim  that  such  an  estimator  Is  to  be  recommended  on  grounds 
of  robustness,  (See  Gastwlrth  (1966)  and  Andrews  et.  al.  (1972),  for 
examp 1 e . ) 

The  next  mathematical  work  to  appear  on  order  statistics  was 

Karl  Pearson's  (1902)  examination  of  the  Gallon  difference  problem.  In 

this  paper,  which  was  Inspired  by  an  Inquiry  of  Gallon's  (1902)  as  to  the 

most  suitable  proportion  between  the  values  of  first  and  second  prizes, 

Pearson  gave  the  joint  density  of  any  two  consecutive  order  statistics 

and  found  their  expected  difference.  He  remarked  in  a  footnote  that 

"I  propose  on  another  occasion  to  consider  the  application  of  Galton's 
problem  to  a  new  theory  for  the  rejection  of  outlying  individuals". 

This  proposal  was  later  carried  out  by  J.  0.  Irwin  (1925). 

In  1920,  a  remarkable  paper  appeared  in  the  American  Journal  of 
Mathematics  (the  journal  Simon  Newcomb  co-founded)  by  the  English  mathe¬ 
matician  P.  J.  Daniel  1.  This  paper,  "Observations  weighted  according  to 
order",  has  been  all  but  totally  overlooked  since  it's  publication.  It 
could  in  fact  be  claimed  that-  Dari i e  1 1  was  at  least  thir  1  years  ahead  of 
his  time,  for  it  took  that  long  for  his  major  results  to  be  rediscovered. 
While  his  paper  Itself  is  a  model  of  clarity  and  rigor,  its  relevance  to 
modem  work  Is  such  that  it  merits  a  short  summary,  in  his  ov/n  notation. 

The  work  was  apparently  inspired  by  a  reading  of  Poincare's  Cal cul 
des  Probability  (1912).  After  remarking  how  Poincare  had  suggested  dis- 
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carding  extreme  observations  (when  normality  is  suspect)  before  taking  the 
mean,  Daniel  1  wrote: 

"Besides  such  a  discard-average  [ie.  the  trimmed  mean]  we  might 
invent  others  in  which  v/eights  might  be  assigned  to  the  measures 
according  to  their  order.  In  fact  the  ordinary  average  or  mean,  the 
median,  the  discard-average,  the  numerical  deviation  (from  the  median, 
which  makes  it  a  minimum),  and  the  quartile  deviation  can  all  be 
regarded  as  calculated  by  a  process  in  which  the  measures  are  multiplied 
by  factors  which  are  functions  of  order.  It  is  the  general  purpose 
of  this  paper  to  obtain  a  formula  for  the  mean  square  deviation  of 
any  such  expression.  This  formula  may  then  be  used  to  measure  the 
relative  accuracies  of  all  such  expressions". 

Daniell's  analysis  proceeded  as  follows:  First  he  explicitly 

introduced  the  probability  integral  transformation  (apparently  the  first 
★ 

time  this  was  done  )  and  explained  how  it  can  be  used  to  find  the  moments 
of  any  function  of  order  statistics.  Then,  he  assumed  the  population 
density  p(t)  was  regular  (and  indefinitely  differentiable),  and  he 
expanded  the  inverse  of  the  distribution  function  in  a  Taylor  series  to 
derive  asymntotic  expressions  for  the  mean  of  an  order  statistic  tp  and 
the  mean  product  of  any  two.  He  thus  duplicated  some  of  Sheppard's  (1899) 


results,  but  in  a  much  more  rigorous  manner. 

Daniel  1  then  considered  statistics  of  the  form  t  =  £ 

t  h 

where  he  assumed  that  the  weight  f  associated  with  the  rtn 
statistic  t  was  given  by 


n 


r=l 
order 


frV 


^nir1  ■ 


★ 


The  next  being  Karl  Pearson  (1931). 
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and  put  things  together  to  obtain  the  (now  standard  )  expression  for  the 
asymptotic  variance  of  t, 

00 

S2  =  /  <T(t)  p(t)dt, 

— oo 

t 

where  <t>(t)  is  the  indefinite  integral  of  f(x(t)),  x(t)  =  /  p(u)du.  If 

—CO 

he  was  less  than  specific  as  to  why  the  remainder  terms  are  uniformly 
negligible,  his  standard  of  rigor  was  nonetheless  far  above  that  of  the 
statistical  literature  of  the  time. 

In  the  third  section  of  the  paper,  Daniel  1  gave  the  conditions 

on  f  under  which  the  asymptotic  mean  of  t  is  the  population  mean  or 

standard  deviation,  and  defined  the  "accuracy"  of  t  as  the  ratio  of  the 
asymptotic  variance  of  the  sample  mean  (or  sample  standard  deviation, 
as  the  case  may  be)  to  tiat  of  t.  (He  also  derived  the  asymptotic  variance 

of  the  sample  standard  deviation  here.)  In  the  fourth  section,  Daniel  1 

2 

gave  the  optimal  weight  function  f  -  that  which  minimizes  S  -  for  both 
the  location  and  scale  cases,  using  standard  results  from  the  calculus  of 
variations,  and  noted  that  the  optimal  estimate  of  o  for  the  normal  case 
is  as  accurate  as  the  sample  standard  deviation  in  this  case.  These  results 
were  not  to  appear  in  print  again  until  Jung  (1955),  although  they  are  in 
Bennett's  (1952)  unpublished  thesis. 

The  final  two  sections  were  concerned  with  applications.  Daniel! 
gave  special  attention  to  the  "discard-average"  (the  trimmed  mean), 


* 


See  Chemoff,  Gastwirth,  and  Johns  (1967). 
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presenting  the  (now  standard)  expression  for  its  asymptotic  variance  and 

evaluating  its  performance  for  various  Pearson  densities,  including 

Student's  t.  He  also  gave  conditions  under  which  the  quartile-discard 

average  is  superior  to  the  sample  mean.  The  paper  ended  with  a  number 

★ 

of  applications  to  other  estimators  of  location  and  scale  ,  with  numerical 
results.  Daniel!  did  not  derive  the  asymptotic  normality  of  t,  nor  did 
he  try  to  state  minimal  regularity  conditions  (indeed,  some  of  his  regularity 
conditions  were  implied  rather  than  stated).  However,  taken  altogether 
it  is  a  thoroughly  modern  paper  which  almost  appears  to  have  been  gleaned 
from  the  literature  of  the  1 950 ' s  and  1960 's. 

How  could  such  a  paper  have  gone  unnoticed  for  all  these  years? 

To  see  why,  we  need  to  learn  something  of  Daniell's  life.  Percy  John  Daniell 
(1889-1946)  received  a  B.A.  degree  at  Cambridge  in  1910  (and  an  M.A.  in 
1914),  where  his  honors  included  Senior  Wrangler  in  Mathematics  (1909), 

First  Class  Physics  Tripos  (1910),  and  the  Raleigh  Prize  (1912).  His  stay 
at  Cambridge  would  have  overlapped  R.  A.  Fisher's,  but  they  were  at 
different  colleges  and  may  not  have  met.  After  graduation  (and  brief  stays 
at  Gottingen  and  Liverpool),  Daniel!  went  to  Rice  Institute  in  Houston, 

Texas  in  1912  as  a  travelling  fellow.  He  remained  at  Rice  until  1923, 
becoming  a  full  professor  in  1920.  It  was  at  Rice  he  did  his  most  important 
work,  principally  on  the  theory  of  integration  (including  the  development 
of  what  is  now  known  as  the  Daniell  integral.)  In  1924  he  returned  to 


■ k 

Including  the  "discard-deviation",  where  the  inner  quartiles  are 
discarded. 

A  fairly  complete  review  of  the  literature  reveals  only  two  published 
citations,  Dodd  (1922)  and  Greenberg  (1968),  and  the  descriptions  there 
are  superficial  and  misleading.  Daniell's  paper  came  to  my  attention  as 
the  result  of  a  systematic  inspection  of  the  American  Journal  of  Mathematics. 
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England  to  the  University  of  Sheffield,  where  he  remained  until  his  death 
at  the  age  of  57.  In  the  latter  part  of  his  life  he  published  occasional 
papers  on  applied  mathematics,  on  such  topics  as  flame  motion,  potentials, 
and  quadrature  formulae. 

The  paper,  Oaniell  (1920),  written  at  Rice,  seems  to  have  been 
his  only  related  work  in  statistics.  This  fact,  together  with  his  isolation 
from  active  statistical  research  (both  at  Rice  and  Sheffield),  was  largely 
responsible  for  the  obscurity  of  the  paper.  Daniel  1 ' s  death  before  his 
results  were  rediscovered  and  widely  discussed,  and  Wilks*  overlooking 
his  work  in  the  survey  paper  of  1948  also  served  to  delay  recognition  of 
his  priority.  As  a  further  irony,  these  circumstances  have  helped  relegate 
to  obscurity  another  important  paper  of  Daniel l's,  "Integral  products  and 
probability"  (1921),  in  which  he  presents  one  of  the  earliest  mathematical 
treatments  of  continuous  time  Markov  processes,  including  the  Chapman- 
Kolmogorov  equation  (ten  years  before  Kolmogorov)  and  a  short  treatment 
of  the  Wiener  process  (two  years  before  Wiener). 

4.  M-estimators 


Recently,  much  attention  has  been  given  to  a  class  of  robust 
estimators  which  Huber  calls  "M-estimators",  M  for  maxi mum- 1  ike li hood 
type.  (See  Huber,  1972).  T  is  said  to  be  an  M-estimator  corresponding 
to  a  function  <j>  if  T  is  a  solution  to  £<j>(X^  -  T)  =0.  Each  choice  of 
<p  determines  an  estimator;  if  <J>  =  p'/p,  T  is  the  maximum  likelihood 
estimator  for  the  location  parameter  of  the  population  with  density  p(t  -  o). 
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As  the  first  appearance  of  these  estimators  in  the  context  of  robustness 
seems  to  be  in  the  work  of  Jeffreys  after  1920  (see  Jeffreys  (1932)  and 
(1939)  in  particular),  and  as  this  work  is  outside  the  scope  of  this  study, 
we  shall  not  dwell  on  this  subject.  However,  we  cannot  resist  calling 
attention  to  an  early  reference  in  which  the  class  of  M~estimators  is 
introduced  and  their  consistency  claimed. 

In  a  paper  examining  the  various  "proofs"  of  the  method  of  least 

squares,  Ellis  (1844)  began  with  Gauss's  first  proof.  Letting  x^ 's 

denote  observed  values,  a  the  quantity  to  be  estimated,  and  ei  =  x^  -  a, 

Ellis  questions  Gauss's  a  priori  designation  of  the  arithmetic  mean  (the 

solution  to  £(x.  -  a)  =  0)  as  the  most  probable  value. 

"It  [the  arithmetic  mean]  is  not  the  only  rule  to  which  these 
considerations  might  lead  us.  For  not  only  is  Je  -  0  ultimately, 
but  £fe  =  0,  where  fe  is  any  function  such  that  fe  =  -f(-e); 
and  therefore  we  should  have 


£f(x-a)  =  0, 


as  an  equation  which  ultimately  would  give  the  true  value  of  x  when 
the  number  of  observations  increases  sine  limite,  and  which  therefore 
for  a  finite  number  of  observations  may  be  looked  on  in  precisely  the 
same  way  as  the  equation  which  expresses  the  rule  of  the  arithmetic 
mean.  There  is  no  discrepancy  between  these  two  results.  At  the  limit 
they  coincide:  short  of  the  limit  both  are  approximations  to  the 
truth.  Indeed  we  might  form  some  idea  how  far  the  action  of  fortuitous 
causes  had  disappeared  from  a  given  series  of  observations  by  assigning 
different  forms  of  f,  and  comparing  the  different  values  thus  found 
for  a. 

"No  satisfactory  reason  can  be  assigned  why,  setting  aside  mere 
convenience,  the  rule  of  the  arithmetic  mean  should  be  sinqled  out 
from  other  rules  which  are  included  in  the  general  equation  £f(x-a)  =  0". 

Thus  Ellis  has  claimed  (without  proof  or  regularity  conditions) 

★ 

the  consistency  of  M-estimators  ,  and  even  suggested  the  class  may  be  useful 
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for  judging  to  what  degree  an  estimated  value  depends  on  the  choice  of 
estimator,  a  stability  test.  Of  course  Ellis  was  not  really  concerned  with 
robustness,  only  with  illuminating  the  arbitrary  nature  of  Gauss's  proof, 
but  his  comments  are  of  interest  nonetheless. 

A  Note  on  the  References 


In  addition  to  those  works  cited  below,  many  other  works  were 
consulted  for  references  and  general  information.  The  information  on 
the  life  of  Simon  Newcomb  came  principally  from  the  Encyclopedia 
Britannica,  the  International  Encyclopedia  of  the  Social  Sciences,  and 
Newcomb's  autobiography  (1903).  The  information  on  the  life  of  Percy 
Daniell  came  from  various  editions  of  Who's  Who  and  American  Men  of 
-cience,  and  Stewart  (1947).  Merriman's  (1872)  bibliography  on  least 
squares  was  quite  useful  for  the  period  prior  to  1877.  I  would  also  like 
to  thank  William  Kruskal ,  Churchill  Eisenhart,  and  Oscar  B.  Sheynin  for 
a  number  of  references  and  helpful  comments.  A  good  bibliography  of 
work  since  1920  can  be  found  in  H.  A.  David's  Order  Statistics  (1970). 


it 

Huber  proved  this  in  Huber  (1964). 
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